Skip to content

Conversation

@twoertwein
Copy link
Member

@twoertwein twoertwein commented Apr 17, 2021

My best guess is that memory_map=True always assumed UTF-8 with the python engine. Now that the c and python engine use the same IO code, the c engine assumed UTF-8 as well.

@pep8speaks
Copy link

pep8speaks commented Apr 17, 2021

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-04-25 14:56:08 UTC

@twoertwein twoertwein added IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version labels Apr 17, 2021
@twoertwein twoertwein marked this pull request as draft April 17, 2021 02:26
@twoertwein twoertwein marked this pull request as ready for review April 25, 2021 14:20
def __iter__(self) -> _MMapWrapper:
return self

def read(self, size: int = -1) -> str | bytes:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could be removed if the c-engine handled non-utf-8 better.

The PR in its current form will in case of the c-engine: 1) use mmap to read the entire file 2) decode it appropriately (this function) 3) the c-code will encode the now utf-8 string into bytes again. It would be more efficient if the c-engine supported non-utf-8 in more places. I will look into that, but that might take some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob not worth the effort to handle non-utf-8 better, but if you want to look...ok

@jreback jreback added this to the 1.2.5 milestone Apr 26, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comments for future.

# memory mapping needs to be the first step
handle, memory_map, handles = _maybe_memory_map(
handle, memory_map, ioargs.encoding, ioargs.mode, errors
handle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't object to passing by kwargs for easier reading

def __iter__(self) -> _MMapWrapper:
return self

def read(self, size: int = -1) -> str | bytes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob not worth the effort to handle non-utf-8 better, but if you want to look...ok

@jreback jreback merged commit 0a0540c into pandas-dev:master Apr 26, 2021
@jreback
Copy link
Contributor

jreback commented Apr 26, 2021

@meeseeksdev backport 1.2.x

@jreback
Copy link
Contributor

jreback commented Apr 26, 2021

thanks @twoertwein keep em coming!

@lumberbot-app

This comment has been minimized.

return newline

# IncrementalDecoder seems to push newline to the next line
return newline.lstrip("\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may leave \r at the end of the line when newline is CRLF, which is often used in windows.

Maybe try newline.lstrip("\n").rstrip("\r")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! It might be worth adding tests to directly test the output of the mmap wrapper independent of read_csv.

I would assume that CRLF is covered by some of the Windows CI but it might be that the c-engine and python's csv are robust enough to ignore that :)

yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
simonjayhawkins pushed a commit that referenced this pull request May 25, 2021
Co-authored-by: phofl <patrick_hoefler@gmx.net>
@twoertwein twoertwein deleted the memory_map branch June 5, 2021 20:50
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: read_csv is failing with an encoding different that UTF-8 and memory_map set to True in version 1.2.4

5 participants